AITopics | n-step return

Collaborating Authors

n-step return

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

Tian, Dong, Li, Ge, Zhou, Hongyi, Celik, Onur, Neumann, Gerhard

arXiv.org Artificial IntelligenceMar-6-2025

Unlike traditional methods that focus on evaluating single state-action pairs or apply action chunking in the actor network, this approach feeds chunked actions directly into the critic. Leveraging the Transformer's strength in processing sequential data, the proposed architecture achieves more robust value estimation. Empirical evaluations demonstrate that this method leads to efficient and stable training, particularly excelling in environments with sparse rewards or Multi-Phase tasks.Contribution(s) 1. We present a novel critic architecture for SAC that leverages Transformers to process sequential information, resulting in more accurate value estimations. Context: Transformer-Based Critic Network 2. We introduce a method for incorporating N-Step returns into the critic network in a stable and efficient manner, effectively mitigating the common challenges of variance and importance sampling associated with N-returns. Context: Stable Integration of N-Returns 3. We shift action chunking from the actor to the critic, demonstrating that enhanced temporal reasoning at the critic level--beyond traditional actor-side exploration--drives performance improvements in sparse and multi-phase tasks. Context: Unlike previous approaches that focus on actor-side chunking for exploration, our Transformer-based critic network produces a smooth value surface that is highly responsive to dataset variations, eliminating the need for additional exploration enhancements.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2503.0366

Genre: Research Report > New Finding (0.46)

Industry: Energy > Oil & Gas (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

Li, Ge, Tian, Dong, Zhou, Hongyi, Jiang, Xinkai, Lioutikov, Rudolf, Neumann, Gerhard

arXiv.org Artificial IntelligenceOct-12-2024

This work introduces a novel off-policy Reinforcement Learning (RL) algorithm that utilizes a transformer-based architecture for predicting the state-action values for a sequence of actions. These returns are effectively used to update the policy that predicts a smooth trajectory instead of a single action in each decision step. Predicting a whole trajectory of actions is commonly done in episodic RL (ERL) (Kober & Peters, 2008) and differs conceptually from conventional step-based RL (SRL) methods like PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018a) where an action is sampled in each time step. The action selection concept in ERL is promising as shown in recent works in RL (Otto et al., 2022; Li et al., 2024). Similar insights have been made in the field of Imitation Learning, where predicting action sequences instead of single actions has led to great success (Zhao et al., 2023; Reuss et al., 2024). Additionally, decision-making in ERL aligns with the human's decision-making strategy, where the human generally does not decide in each single time step but rather performs a whole sequence of actions to complete a task - for instance, swinging an arm to play tennis without overthinking each per-step movement. Episodic RL is a distinct family of RL that emphasizes the maximization of returns over entire episodes, typically lasting several seconds, rather than optimizing the intermediate states during environment interactions (Whitley et al., 1993; Igel, 2003; Peters & Schaal, 2008). Unlike SRL, ERL shifts the solution search from per-step actions to a parameterized trajectory space, leveraging techniques like Movement Primitives (MPs) (Schaal, 2006; Paraschos et al., 2013) for generating action sequences. This approach enables a broader exploration horizon (Kober & Peters, 2008), captures temporal and degrees of freedom (DoF) correlations (Li et al., 2024), and ensures smooth transitions between re-planning phases (Otto et al., 2023).

machine learning, reinforcement learning, trajectory, (15 more...)

arXiv.org Artificial Intelligence

2410.09536

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
Asia > Vietnam > Long An Province (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment > Sports (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Demystifying the Recency Heuristic in Temporal-Difference Learning

Daley, Brett, Machado, Marlos C., White, Martha

arXiv.org Artificial IntelligenceJun-18-2024

The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD($\lambda$), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as $n$-step returns, satisfy a weaker (i.e., non-monotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a long window of effective credit assignment, yet bounded worst-case variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.

n-step return, recency heuristic, weak recency heuristic, (16 more...)

arXiv.org Artificial Intelligence

2406.12284

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Distributed Multi-Objective Dynamic Offloading Scheduling for Air-Ground Cooperative MEC

Huang, Yang, Dong, Miaomiao, Mao, Yijie, Liu, Wenqiang, Gao, Zhen

arXiv.org Artificial IntelligenceMar-16-2024

Utilizing unmanned aerial vehicles (UAVs) with edge server to assist terrestrial mobile edge computing (MEC) has attracted tremendous attention. Nevertheless, state-of-the-art schemes based on deterministic optimizations or single-objective reinforcement learning (RL) cannot reduce the backlog of task bits and simultaneously improve energy efficiency in highly dynamic network environments, where the design problem amounts to a sequential decision-making problem. In order to address the aforementioned problems, as well as the curses of dimensionality introduced by the growing number of terrestrial terrestrial users, this paper proposes a distributed multi-objective (MO) dynamic trajectory planning and offloading scheduling scheme, integrated with MORL and the kernel method. The design of n-step return is also applied to average fluctuations in the backlog. Numerical results reveal that the n-step return can benefit the proposed kernel-based approach, achieving significant improvement in the long-term average backlog performance, compared to the conventional 1-step return design. Due to such design and the kernel-based neural network, to which decision-making features can be continuously added, the kernel-based approach can outperform the approach based on fully-connected deep neural network, yielding improvement in energy consumption and the backlog performance, as well as a significant reduction in decision-making and online learning time.

task bit, timeslot, uav, (13 more...)

arXiv.org Artificial Intelligence

2403.10927

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Jiangsu Province > Nanjing (0.04)
North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Industry:

Energy (0.49)
Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.48)

Add feedback

Re-evaluating Complex Backups in Temporal Difference Learning

Neural Information Processing SystemsMar-15-2024, 14:54:38 GMT

We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n.

algorithm, estimator, n-step return, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Asia > Singapore (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.35)

Add feedback

Compound Returns Reduce Variance in Reinforcement Learning

Daley, Brett, White, Martha, Machado, Marlos C.

arXiv.org Artificial IntelligenceFeb-6-2024

Multistep returns, such as $n$-step returns and $\lambda$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- weighted averages of $n$-step returns -- to reduce variance. We prove for the first time that any compound return with the same contraction modulus as a given $n$-step return has strictly lower variance. We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. Because general compound returns can be expensive to implement, we introduce two-bootstrap returns which reduce variance while remaining efficient, even when using minibatched experience replay. We conduct experiments showing that two-bootstrap returns can improve the sample efficiency of $n$-step deep RL agents, with little additional computational cost.

compound return, n-step return, variance, (13 more...)

arXiv.org Artificial Intelligence

2402.03903

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Revisiting Fundamentals of Experience Replay

Fedus, William, Ramachandran, Prajit, Agarwal, Rishabh, Bengio, Yoshua, Larochelle, Hugo, Rowland, Mark, Dabney, Will

arXiv.org Machine LearningJul-13-2020

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

2007.067

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Multi-Step Reinforcement Learning: A Unifying Algorithm

Asis, Kristopher De (University of Alberta) | Hernandez-Garcia, J. Fernando (University of Alberta) | Holland, G. Zacharias (University of Alberta) | Sutton, Richard S. (University of Alberta )

AAAI ConferencesFeb-8-2018

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) that unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Sharma, Sahil, J, Girish Raguvir, Ramesh, Srivatsan, Ravindran, Balaraman

arXiv.org Artificial IntelligenceNov-5-2017

Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state's value function. $\lambda$-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While lambda-returns have been extensively studied in RL, they haven't been explored a lot in Deep RL. This paper's first contribution is an exhaustive benchmarking of lambda-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, lambda-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1705.07445

Country: Asia (0.68)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Improving Approximate Value Iteration with Complex Returns by Bounding

Wright, Robert William (Air Force Research Laboratory - Information Directorate and Binghamton University) | Qiao, Xingye (Binghamton University) | Loscalzo, Steven (Air Force Research Laboratory - Information Directorate) | Yu, Lei (Binghamton University)

AAAI ConferencesMar-6-2015

Approximate value iteration (AVI) is a widely used technique in reinforcement learning. Most AVI methods do not take full advantage of the sequential relationship between samples within a trajectory in deriving value estimates, due to the challenges in dealing with the inherent bias and variance in the $n$-step returns. We propose a bounding method which uses a negatively biased but relatively low variance estimator generated from a complex return to provide a lower bound on the observed value of a traditional one-step return estimator. In addition, we develop a new Bounded FQI algorithm, which efficiently incorporates the bounding method into an AVI framework. Experiments show that our method produces more accurate value estimates than existing approaches, resulting in improved policies.

machine learning, reinforcement learning, trajectory, (16 more...)

AAAI Conferences

Twenty-Ninth AAAI Conference on Artificial Intelligence

Country: North America > United States > New York > Broome County > Binghamton (0.05)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback